-
Notifications
You must be signed in to change notification settings - Fork 4.9k
WIP SqlClient handle socket network error gracefully #37068
Conversation
A quick look shows that you removed TdsParserStateObject from the change and rest of the stuff looks close to original PR. |
Sorry, I should have highlighted the specific change from the previous, it was late. The important change is in corefx/src/System.Data.SqlClient/src/System/Data/SqlClient/SNI/SNIMarsConnection.cs Lines 272 to 283 in 13cd961
sniErrorCode would be set to With this set of changes in place it got through 10 back to back runs of the manual test suite under varying amounts of system load. If there's still a problem with it then I'll need a very specific replication to find it. |
|
||
((TdsParserStateObject)_callbackObject).ReadAsyncCallback(PacketHandle.FromManagedPacket(packet), 1); | ||
if (sniErrorCode == TdsEnums.SNI_WSAECONNRESET || sniErrorCode == TdsEnums.SNI_ERROR) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit concerned about this change in terms of general design. This makes SNI aware of the connection handling mechanisms of TdsParserStateObject now. And now SNI is managing the connection (connection pool) based on the error.
Earlier, HandleReceiveError
was calling the ReadAsyncCallback with 1
viz SNI_ERROR.
In theory, HandleReceiveError has already signalled to the callback, that an error occurred, and it looks like the error should be handled in TdsParserStateObject. But now we are shortcircuiting the job of what TdsParserStateObject
should be doing, in this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the point when we've got an error and we know that the packet contains no data should we be calling the receive callback when we haven't received anything? If we do then the problem is that the ReadAsyncCallback checks the error too late (or asserts too early, depends on your point of view) to be useful, I think the assert is valid because at that point it's acting as if there's a successful read and checking to make sure that's the case which leads to the conclusion that we shouldn't get there at all.
Are you proposing to move the logic handling the specific error code out of the mars handle and into the top of the callback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read through your findings in the issue itself, where Decrement RefCount was asserting.
The part which I am finding hard to wrap my head around is, that in this fix, we are calling BreakConnection of the common code and making SNI cognizant of the upper layers, which means that we have skipped a bunch of error handling logic, and that's not how this layering with SNI was designed in the first place.
My understanding is that receive callback should be called with an indication that an error occured and the CheckPacket shouldn't check for Data being null. This is because the Debug assert for check packet is used for both successful and unsuccessful cases. And there is no reason why data will be available in unsuccessful cases.
The next thing that I want to understand, is that why the decrement call back is asserting and why its going to a -2
in these cases. (as you have mentioned in the issue). The idea is not to make changes to common code. As you correctly pointed out in the issue, the common code changing much is a big violation of the design.
I feel like this is uncovering problems with the way the common code and managed SNI are interacting. I had fixed one such issue for MARS where I found errors in Managed SNI, where the ref count was being incremented in sync MARS cases, where it shouldn't https://github.com/dotnet/corefx/pull/28427/files
I think there is another bug lurking around in Managed SNI which when coupled with fixing the CheckPacket condition should magically make things work. However I haven't looked deeply enough to say what it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's at least one other bug in the SNI stuff that I'm aware of at the moment. As I work on it I find I'm gradually simplifying and streamlining it which fits with the spirit of change you made in the PR you linked.
While I agree I think fixing the issue this way is better than not fixing it at the moment. I'm still trying to spend time on improving the performance and reliability of the managed implementation but this bug blocks a lot of tests that I could really do with having working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I totally understand why you want to fix this.
Here's something I found. I changed CheckPacket to return true for managed SNI.
If I run just the MARSTest where the Debug.Assert is always hit, /p:XunitClassName=System.Data.SqlClient.ManualTesting.Tests.MARSTest
I find that the SqlConnection.Dispose() is called before the call back is invoked, which causes the DecrementPendingCallbacks to fail. The SqlConnection.Close() when called, decrements the pending callback to 0 and then the receipt of failure message from SNIpacket again decrements the pending callback to -1
at System.Data.SqlClient.TdsParserStateObject.DecrementPendingCallbacks(Boolean release, String method) in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\TdsParserStateObject.cs:line 861
at System.Data.SqlClient.TdsParserSessionPool.Dispose() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\TdsParserSessionPool.cs:line 107
at System.Data.SqlClient.TdsParser.Disconnect() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\TdsParser.cs:line 939
at System.Data.SqlClient.SqlInternalConnectionTds.Dispose() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\SqlInternalConnectionTds.cs:line 599
at System.Data.ProviderBase.DbConnectionPool.DestroyObject(DbConnectionInternal obj) in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\ProviderBase\DbConnectionPool.cs:line 910
at System.Data.ProviderBase.DbConnectionPool.DeactivateObject(DbConnectionInternal obj) in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\ProviderBase\DbConnectionPool.cs:line 882
at System.Data.ProviderBase.DbConnectionPool.PutObject(DbConnectionInternal obj, Object owningObject) in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\ProviderBase\DbConnectionPool.cs:line 1460
at System.Data.ProviderBase.DbConnectionInternal.CloseConnection(DbConnection owningObject, DbConnectionFactory connectionFactory) in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\ProviderBase\DbConnectionInternal.cs:lin
e 270
at System.Data.SqlClient.SqlConnection.CloseInnerConnection() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\SqlConnection.cs:line 637
at System.Data.SqlClient.SqlConnection.Close() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\SqlConnection.cs:line 683
at System.Data.SqlClient.SqlInternalConnectionTds.BreakConnection() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\SqlInternalConnectionTds.cs:line 1747
at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error) in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\TdsParserStateObject.cs:line 2573
at System.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\TdsParserStateObject.cs:line 2128
at System.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\TdsParserStateObject.cs:line 2053
at System.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\TdsParserStateObject.cs:line 1089
at System.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value) in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\TdsParserStateObject.cs:line 1320
at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) in D:\work\corefx\src\S
ystem.Data.SqlClient\src\System\Data\SqlClient\TdsParser.cs:line 1624
at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\SqlDataReader.cs:line 1075
at System.Data.SqlClient.SqlDataReader.get_MetaData() in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\SqlDataReader.cs:line 249
at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString) in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\SqlCommand.cs:line 2845
at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds) in D:\work\corefx\s
rc\System.Data.SqlClient\src\System\Data\SqlClient\SqlCommand.cs:line 2684
at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean asyncWrite, String method) in D:
\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\SqlCommand.cs:line 2515
at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method) in D:\work\corefx\src\System.Data.SqlClient\src\System\Data\SqlClient\SqlCommand.cs:lin
e 2473
This SqlConnection.Dispose() shouldn't happen before this callback is received. This should change the problem statement a little bit, to figure out why this Dispose() is happening before SNI triggers the callback.
If I run the tests against Native SNI, I don't see Dispose being called before the callback invocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I know where the issue is.
When SNIMarsHandle.Receive
is called, then the calling thread is sent an error from
corefx/src/System.Data.SqlClient/src/System/Data/SqlClient/SNI/SNIMarsHandle.cs
Lines 445 to 449 in a3ab91e
if (!_packetEvent.Wait(timeoutInMilliseconds)) | |
{ | |
SNILoadHandle.SingletonInstance.LastError = new SNIError(SNIProviders.SMUX_PROV, 0, SNICommon.ConnTimeoutError, string.Empty); | |
return TdsEnums.SNI_WAIT_TIMEOUT; | |
} |
As a result of this issue, the calling thread disposes the connection instead of handling the timeout via the the callback invocation.
The TdsParserStateObject which receives the timeout in Native SNI, is not the one on which callback are being invoked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole singleton LastError thing makes me nervous. Too much bad experience with globals being used for state in vb6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wraith2 Let me do a bit more digging around this area to provide more pointers and have a concrete explanation. Don't burn your cycles on this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it's already WIP but i can close if that's easier.
@Wraith2 I am reading your comments from https://github.com/dotnet/corefx/issues/33930
This makes sense and should be extrapolated to managed SNI. In case of managed SNI, we should only check that Packet is not null and not check for the data inside the packet. |
I did try that and it just blew up somewhere else, following through that chain was leading me to make lots of changes to shared code and that didn't feel right since native works fine. I think the check though different is correct. |
There is something wrong with Github I merged your other PR and this one shows as merged as well. Need to revert this. |
@Wraith2 There is something strange going on with this PR. It shows the file contents the same as your other PR that got merged. Could you open a new PR again? I also checked your branch(sqlfix-33930) on your remote repo and it seems like it has the same contents as PR 36678 |
I think this is a casualty of the problem I had with my repo trying to rebase on the UDT bytes PR. Since @saurabh is handling this I'll close and delete. |
You tagged the wrong guy. @Wraith2 |
oh, darn, well would you like to deal with this one or shall I leave it with @saurabh500 😁 |
Fixes https://github.com/dotnet/corefx/issues/33930
Then a running the SqlClient test suite in managed mode (Linux, osx, uap) and Debug build, an assert fires indicating that an invalid state has been reached. The assertion indicates that a packet has reached the processing stage and has no contents or that the packet has been received in a state where none was expected. The test failure was caused by an assertion but could lead to silent unexpected non-crashing behaviour in release builds leading to unreliability.
After investigation and debugging I've arrived at three related changes.
The functional and manual tests pass over multiple repeated runs correctly with the affected tests re-enabled.
/cc @afsanehr, @tarikulsabbir, @Gary-Zh , @David-Engel @saurabh500